Introduction

TED (Technology, Entertainment, Design) is an organization that posts talks online for free dissemination. TED was founded in 1984 as a conference and has developed a lot so far. And nowadays, it has become the symbol of idea gathering, rooted into our life everywhere around. Even though one barely hears of the TED talk, its famous slogan of “ideas worth spreading” and the typical starry sky at the start of each talk should be probably noticed. It is always inspiring to enjoy a TED talk after work or study in our daily life and every fan of TED talk must have his or her own evaluation criteria to rate TED talks with multifarious topics. Hence, these talks, full of wisdom, gain abundant reviews and comments from viewers, which can serve as implication of both popularity of the talks and those viewers’ interests.

Members in our group have always been fascinated by TED talks and the marvelous diversity of the contents (see the fantastic topic cloud below). We tend to focus on topics that we are interested in and share our thoughts about it. An idea therefore came to our mind that we can try to explore what exactly affects the popularity of a particular TED talk by combining the skills out of the data science course and some strategies to operate the data obtained.

Based on the dataset, there are some points expected to be explored in this study. Firstly, this report will include basic description and operation of several remarkable variables among all 17. For instance, in the “tag” varibale, information about the topic (or theme) of that talk can be extracted, which can lay a foundation for further investigation of the association between the topic and popularity of a talk. In addition, the development of TED over time is anticipated to be presented by this means. Secondly, this report will review the connection of the popularity of a TED talk with other varibales. What makes a TED talk popular is the main focus in this study.

Methods

In this study, we follow a step-by-step analysis to reveal the association between the popularity and other variables. And the popularity of one TED talk is represented by the number of views (“views” variable). And most analyses involve the operation on strings within the dataset.

  1. Topic (or theme) of a TED talk is the first variable whose association with popularity will be evalueated since mostly viewers may prefer the TED talks with topics that they like. In this case, the topic of each TED talk can be extracted primarily from the “tag” variable for each TED talk. Furthermore, the topic of each TED talk can be summarized. In terms of those TED talks with multiple tagged topics, all topics are kept since it is realistic that one talk can be classified into various themes. At last, the most frequently appeared topics are ranked out with tables and figures made to show the top 10 topics with the most talks and the change of the favorite topics over time is also plotted.

  2. To explore the connection of popularity of a TED talk with the its speaker, we find out the top ten speaker occupations with most views using similar strategies to the topic analysis. In this part, we realized that some tems of occupation almost have the same meaning, i.e. “author” and “writer”, which appear in the dataset very frequently. Therefore, we decided to unify the occupation into “writer” in this case. It can also be noticed that one speaker may have more than one occupation, among which the first occupation is always the main occupation of the speaker, best representing his or her main focus. So the “speaker_occupation” variable is cleaned, leaving only the first occupation in the dataset. The association between TED talks’ popularity (number of views) and the occupation of speakers will be presented using a boxplot.

  3. The “ratings” variable incorporates abundant information about the reviews on each TED talk from viewers. We noticed that each talk may have different types of reviews with varying counts. It would be interesting to conduct a sentiment analysis to the rating. In this study, the “ratings” will be splitted by certain symbols to extract each type of review for each TED talk at first. Then sentiment of that talk will be calculated based on the positiveness of each review after combininig with “bing” and the counts. In addition, considering the specific content in the “ratings” variable, some positive word, such as “fascinating”, “convincing” or “interesting”, may also reflect the popularity of that talk. We therefore compare the top 10 talks with most views (most popular) with the top 10 talks with largest positive sentiments furthermore to check the level of match and assess whether sentiment really can reflect the popularity of TED talks.

  4. After foregoing anaylysis, we would like to take other covariates that are not invovled into consideration. Hence, we will further explore the linear realtionship between the number of views for each talk and other predictors including duration, languages translated, years, number of speakers and sentiment ratings to show factors that may affect the popularity. In this case, “year” is categorized into three arms, “before 2010” (2006-2009), “between 2010 and 2015” (2010-2014) and “after 2015” (2015-2017) and the first arm will serve as reference level.

Results and Analysis

1) The association between Topic and Popularity

In this section, we try to find out how diverse topics affect the popularity of TED talks.

<<<<<<< HEAD

TED includes talks on 419 different topics. The figure above demonstrates the most 10 talked topics. Obviously, Technology is the most talked topic with 727 talks.

This figure shows the distribution of views for top 10 topics. It can be found that all the distributions of views are heavily right skewed, which indicates that some of the talks are extremely popular. We reordered the distributions by the median of views. Among the most talked 10 topics, culture and business had the highest median number of views. Although TED talks about technology the most, audience show more interest in culture or business related talks.

=======

TED includes talks on 419 different topics. The figure above demonstrates the most 10 talked topics. Obviously, Technology is the most talked topic with 727 talks.

This figure shows the distribution of views for top 10 topics. It can be found that all the distributions of views are heavily right skewed, which indicates that some of the talks are extremely popular. We reordered the distributions by the median of views. Among the most talked 10 topics, culture and business had the highest median number of views. Although TED talks about technology the most, audience show more interest in culture or business related talks.

>>>>>>> e5a862a366096e3818b2b4e53714c96ab788fb06

The first figure above shows how many videos talking about top 10 topics in each year. It seems that the topic “TEDx” are frequently mentioned in 2012. TEDx is a program supporting independent organizers who want to create a TED-like event in their own community. In 2012, a week-long event-TEDx Summit was held by the Doha Film Institute, the inaugural event gathered TEDx organizers from around the world for workshops, talks and cultural activities. It may result in the increase of TEDx talks in 2012. In 2016, we also see a peak in the increase of all the talks. There are several global events held in 2016, including “TED 2016 Dream”, which is a conference about ideas, happening February 15-19, 2016, in Vancouver, BC, Canada.

The second figure above shows the trends in the share of top 10 topics along the years. We can see that culture related talks have been viewed most in 2006 when the first six TED Talks were posted online. However, talks on culture have witnessed a dip, decreasing steadily since 2013. In contrast, the topic innovation and health are drawing more and more attention along the years.

2) The association between Speaker and Popularity

From the above table, we find out that statistician Hans Rosling gave most TED Talks among all the TED speakers, totally 9 Talks. As a professor of global health at Sweden’s Karolinska Institute, his current work focuses on dispelling common myths about the so-called developing world, which (he points out) is no longer worlds away from the West. Then it followed by biologist Juan Enriquez, who gave 7 Talks. The range of the number of TED Talks given by top ten speaker is from 5 to 9, probably indicates that these speakers are very popular among viewers and they give TED Talks frequently.

speaker_occupation number_talks
writer 107
artist 46
designer 45
journalist 40
entrepreneur 36
inventor 36
architect 33
psychologist 31
neuroscientist 30
physicist 26

Based on the results in the table above, most speakers attending the TED talks are writer. Totally 107 writers came and gave TED Talks. Then comes to artist and designer, journalist, entrepreneur, inventor, architect, psychologist, neuroscientist and physicist. We are surprised to find out that the top four occupations are all about arts, indicating that people who work with arts are more willing to give a TED Talk and their talks might attract more viewers.

<<<<<<< HEAD
=======
>>>>>>> e5a862a366096e3818b2b4e53714c96ab788fb06

Since there are several extreme points in this polt making the plots too compacted and hard to observe the distribution of each speaker occupation, we limit the range of y axis to make the plot more explicit. From the resulted boxplot, we find that physicist has the lowest median while psychologist has the highest median, indicating that TED Talks given by physicist are more popular and attract more viewers than TED Talks given by speakers with other occupations. In addtion, it can be discovered that there are always some extreme points for each occupation. It is easy to be understood because these talks may be given by the most famous or authoritative people in that field or the content of that talk is relevant to the hottest foucs at that time.

3) Sentiment analysis

Here in the sentiment analysis, we extract the sentiment words and the corresponding counts for each observation, and define the words as ‘positive’ or ‘negative’ accroding to Bing sentiment analysis. Then we calculate sentiment score for each observation as difference of the sum of positive and negative counts. The visualization of the distribution of sentiment follows with a plot, showing the distribution of sentiment score. Since the plot with original sentiment score on y-axis are highly skewed, we transform the score to its cube root to help visualize.

<<<<<<< HEAD
=======
>>>>>>> e5a862a366096e3818b2b4e53714c96ab788fb06

It can be seen that most of the ted talks have positive sentiment ratings, since only a small portion on the plot is in the negative side of y axis. Further, we find out that those ted talks with large number of viewers also receive high ratings, since the color yellow and green, which indicate a higher viewers, mostly appear at the right side of the plot, where the sentiment scores are high.

4) Linear model building

In this model, we set the arm “before 2010” from the categorical variable “year” as the reference. From the linear modelling result, it can be discovered that only the estimated coefficients for the number of speaker (“num_speaker”) is not significant at 0.05 significance level due to the large p-value. Therefore, we can conclude that there is no significant linear association between the outcome (number of views) and the number of speakers, adjusted for other covariates in the model. Then we drop the num_speaker variable and refit the model.

term estimate std.error statistic p.value
(Intercept) -1131852 164325 -6.888 7.111e-12
sentiment 625.1 12.34 50.67 0
yearbetween 2010 and 2015 56237 80398 0.6995 0.4843
yearafter 2015 476150 93227 5.107 3.507e-07
duration 369.4 91.73 4.028 5.799e-05
languages 58326 3788 15.4 3.17e-51

From the table above, the final model we conclude for the study is \[Views = -1.13*10^6 + 625.1 * Sentiment + 5.62*10^4 * I\{Year \ between \ 2010 \ and \ 2015\}\\ + 4.76*10^5 * I\{Year \ after \ 2015\} + 369.4 * Duration + 5.83*10^4 * Languages\]

As for the variable year, the arm “after 2015” presents significant postive estimate, indicating the positive mean difference of number of views for TED talks published after 2015 compared with the number of views for those published before 2010. In other words, TED talks published after 2015 are more popular than those published before 2010. At the same time, it can be noticed that language and duration are also strongly associated with the number of views. Specifically, in terms of “languages”, the mean number of views will increase by 58350 as the number of languages in which the talk is available increases by 1 adjusted for other covariates. As expected, sentiment plays an important role in the model, implying that it is exactly reflection of popularity for TED talks to some extent. Basically, the adjusted R-square is around 60%, suggesting 60% variation in the number of views can be explained by the variation in those covariates. This values is good enough to conlcude that points representing the outcome (number of views) and predictors are well fitted on the linear model.

Conclusion

In this study, the question of why a TED talk becomes popular is explained by analyzing the association between the popularity, represented by the number of views, and other variables provided in the dataset. According to the results, the topic, speaker, duration of a TED talk and the number of language available are all concretely connected with the popularity of that talk. Serveral topics like technology, business and culture are prone to more popularity. The speakers and their occupations also play important roles in drawing attention of population, among which psychologists, writers, scientists and entrepreneur are most popular speaker occupations. It is also revealed that the language available for a TED talk is of importance in its popularity. Therefore, it can be suggested to have more languages available for a talk in order to improve its popularity. At last, popularity for TED talks through time are compared, which reveals the truth that increasing poeple are joining in this sharing talks of thoughts of mind so far and the “ideas worth spreading” are truly propagating.